Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Identifying differentially expressed genes is difficult because of the small number of available samples compared with the large number of genes. Conventional gene selection methods employing statistical tests have the critical problem of heavy dependence of P-values on sample size. Although the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outperformed these statistical test-based methods, the reason why they worked so well is unclear. In this study, we aim to understand this reason in the context of projection pursuit (PP) that was proposed a long time ago to solve the problem of dimensions; we can relate the space spanned by singular value vectors with that spanned by the optimal cluster centroids obtained from K-means. Thus, the success of PCA- and TD-based unsupervised FE can be understood by this equivalence. In addition to this, empirical threshold adjusted P-values of 0.01 assuming the null hypothesis that singular value vectors attributed to genes obey the Gaussian distribution empirically corresponds to threshold-adjusted P-values of 0.1 when the null distribution is generated by gene order shuffling. For this purpose, we newly applied PP to the three data sets to which PCA and TD based unsupervised FE were previously applied; these data sets treated two topics, biomarker identification for kidney cancers (the first two) and the drug discovery for COVID-19 (the thrid one). Then we found the coincidence between PP and PCA or TD based unsupervised FE is pretty well. Shuffling procedures described above are also successfully applied to these three data sets. These findings thus rationalize the success of PCA- and TD-based unsupervised FE for the first time.

Is the paper well organized?
The paper is properly organized, good literature review, suitable motivation and clear explanation on results are positive points to that.
Is the abstract concise?
Yes, but I think it needs to be rephrased after revision to add some comments about any artifacts or negative points in the method, if exist.

Is the introduction motivating?
Yes, Introduction section is motivating.
Are the methodology, results, and conclusions completely developed?
No, they need to be modified and developed according to the technical comments.
Are there language, mathematics, reference, or style errors? There is no mathematical, reference or style error.

Technical Comments:
Are the codes available for this research? As I found, there is no code available for this study, e. g. in Github. If the authors could make the codes available, the manuscript could be much better evaluated, not only for reviewers, but also for possible readers. When it is not possible to upload the code for public access, such as in Github, could they be provided for reviewer for better assessment of the study?
The study is comprehensive and requires large time to be read carefully and being reviewed. The theoretical background has been well explained in details, and the experiments and related models are presented and the algorithm in Fig. 1 is also well presented. I think more explanation about the steps and the parameters in Fig. 1 is required.
The result comparison parts are well organized and presented. The display way is good. But quantitative evaluation is somehow too much that one can get lost in that. I think it would be better that you add more explanation to that.
How did you evaluate the final result? How did you consider to finally selection a methodology for the most complicate problem?
What about when the models are more complex?
The introduction section is a nice one. It is architected very beautifully, while written fully academic and comprehend. I assume that any change in the introduction section is not necessary, but one of the important tasks after publishing a study is to increase its chance to be seen by the most possible number of researchers, so I would like to give two recommendations. First, to get your published study in the list of searched for papers based on keywords, I propose to increase variety of your keywords. In my viewpoint, they do not cover the whole topic of the study and are not widely searched words. I propose to add at least the keyword "data analysis". Second, one of the methods in the publisher's website that brings a publication on to the researchers, is based on the similar publications that they have read before. So, the more you cite similar publication, the more the chance that the search engine in the publisher website propose your paper to the researcher. Besides of that, it will also complete your introduction section. As another advantage, it rises new ideas to the researchers by combining various methods, or resolving drawback of one seen paper by reading the similar one, or extending the methodology to a fully automatic one. So, based on these points, I would like to ask to cite to the following similar publication in the manuscript which used PCA and feature selection for deep learning, but in different field of study. The first proposed publication is: Shahbazi, A., Soleimani Monfared, M., Thiruchelvam, V., Ka Fei, T., Babasafari, A.A., (2020). Integration of knowledge-based seismic inversion and sedimentological investigations for heterogeneous reservoir. Journal of Asian Earth Sciences. The second publication for citation is: Khayer, K., Kahoo, A.R., Soleimani Monfared, M., Tokhmechi, B., and Kavousi, K., (2022). Target-Oriented Fusion of Attributes in Data Level for Salt Dome Geobody Delineation in Seismic Data. Natural resource research, and the other publication could be: Khayer, K., Kahoo, A.R., Soleimani Monfared, M., and Kavouosi, K., (2022). Combination of seismic attributes using graph-based methods to identify the salt dome boundary. Journal of Petroleum Science and Engineering. 215, Part A, 110625, The abstract focusses mainly on the general problem and ignores the other items of the abstract such as the methodology, good introduction, results and conclusion.